Goal for today: train, build, and evaluate random forest models using data from the student survey.
What we’ll do:
The most important packages in this session will be the
rpart package for the decision trees as well as
ranger for the random forest models.
We will use some functions from the ggplot2 package that
we have not yet introduced. We will have two session on this package
later.
Decision trees are simple yet powerful tools that recursively split data
based on the predictor values to make predictions. They extract useful
and easy-to-interpret information, even from large data sets and they
have been widely used in several disciplines. Both discrete and
continuous variables can be used either as dependent variables or
predictors.
We will create two examples: one for a classification task (i.e. when the dependent variable is binary or categorical) and one for a regression task (i.e. when the dependent variable is continuous). For that, we use the student survey:
1. We classify if a student is risk affine or risk
averse based in student characterisitcs. The
risk_affinevariable is binary and takes on the value 1, if
a student has a higher than average risk affinity score.
2. We predict the expected exam grade
statistiknote_w1 based on student characteristics.
library(rpart)
library(rpart.plot)
library(caret)
library(tidyverse)
library(dplyr)
library(ggplot2)
library(patchwork)To be able to directly dive into the maschine learning topic, we cut the data preparation short and drop all missing observations in the variables that we want to use.
data <- read.csv("../00_data/survey/survey_processed.csv")
data <- data %>%
select(-c(id, has_siblings, avg_mathenote, einkommen, age, risiko, geburtsbundesland_name)) %>%
drop_na() %>% # just for illustrational purposes, never just drop missings!
mutate_at(c("female", "risk_affine", "alkohol", "sport", "sidejob", "organ_donor", "erstfach_name"), as.factor)Decision trees use a greedy algorithm for constructing the tree, meaning that at each step, the algorithm makes the optimal split decision based only on the current state of the tree, without considering the impact of future splits. This local optimization approach helps in computational efficiency, but may result in a suboptimal overall tree structure.
A decision tree builds its structure through binary partitions. At each decision node, the data is split into two parts based on a feature and a threshold value. This is done repeatedly, creating a set of decision rules that lead to the terminal leaf nodes. Each leaf node contains a subset of the data that is as similar as possible based on the outcome variable.
The objective of a decision tree is to minimize the dissimilarity or impurity of the observations in each leaf node. In a classification problem, dissimilarity can be measured using metrics such as Gini impurity or entropy. For regression tasks, the dissimilarity is typically measured by the variance within the leaf node.
Classification trees minimize (by default) the gini impurity. This is a measure of how often a randomly chosen element from the set would be incorrectly classified if it were labeled according to the distribution of labels in the set.
\[ \begin{aligned} Gini = 1 - \sum(p_1+p_2)^2, \end{aligned} \] where \(p_i\) is the proportion of elements in the node that belong to class \(i \in {1,2}\).
For regression tasks, decision trees use variance reduction to evaluate splits. Variance measures how much the target values deviate from the mean. Splitting the data should reduce this variance, meaning the outcome values within each subset are more consistent.
When a split occurs the variance reduction is defined as:
\[ \begin{aligned} Variance\ Reduction = Var(S) - \left( \frac{|S_1|}{|S|} Var(S_1) + \frac{|S_2|}{|S|} Var(S_2) \right), \end{aligned} \] where \(S_1\) and \(S_2\) are the two subsets created by the split, and \(\vert S \vert\) represents the number of instances.
We build an example tree using the risk_affine variable
as our target. The goal is to construct a tree that classifies whether a
student is risk affine (risk_affine = 1) or risk averse
(risk_affine = 0), based on selected predictor variables.
We use the rpart() function to build the tree and specify
that we want a classification tree.
# Classification decision tree
class_tree <- rpart(risk_affine ~ female + statistiknote_w1 +
fachsemester + erstfach_name +
alkohol, data = data, method = "class")Using the rpart.plot() function, we can visualize the
tree and see which predictor was chosen for each split and each cutoff
threshhold.
What we learn from the plot:
risk_affine = 0or risk_affine = 1). This is
also reflected by the colors. The second row in the circlesThe splits are displayed, such that the yes is on the
left side and the no on the right side. We plot the same
graph using a different type for a visualization.
# Plot the decision tree with explicit splits
rpart.plot(class_tree, main = "Decision Tree: Classification", type = 5)In the next example, we build a regression tree, which means that our
target variable is continuous. The goal is to model the expected exam
grade statistiknote_w1 as a function of selected predictor
variables. The tree then seeks to minimize the variance in the target
variable for each leaf node, ensuring that observations in each leaf are
as similar as possible. We specify the method to be
anova.
# Regression desicion tree
reg_tree <- rpart(statistiknote_w1 ~ female + risk_affine + fachsemester +
erstfach_name + alkohol,
data = data, method = "anova")Again, we can plot our decision tree.
What we learn from the plot:
How deep should trees be grown?
Suppose we have a continuous target variable \(y\) and a predictor variable \(x\), and we know the underlying relationship between them, meaning we have knowledge of the true function that connects them. We also have observations for both variables. In practice, we typically start with the observed data for both variables without knowing this relationship in advance.
To explore the impact of tree depth, we generate synthetic data for visualization purposes. Our aim is to understand the effect of tree complexity.
# Set seed for reproducibility
set.seed(123)
# Generate some synthetic data
n <- 300 # Number of observations
x <- runif(n, 0, 6) # random uniform points between 0 and 6
y <- sin(x) + rnorm(n, 0, 0.3) # underlying function plus noise
# Create a data frame called "data2"
data2 <- data.frame(x = x, y = y)
# Create a data frame for the true underlying function
x_seq <- seq(0, 6, length.out = 300)
y_true <- sin(x_seq)
# Plot
ggplot(data2, aes(x = x, y = y)) +
geom_point(alpha = 0.5, color = "grey") + # observations
geom_line(aes(x = x_seq, y = y_true), color = "blue", linewidth = 1) + # true function
labs(x = "x", y = "y") +
theme_minimal()We’ll begin with a simple tree model that has just one split. Our
goal is to predict the variable \(y\)
based on the variable \(x\), aiming to
approximate the underlying blue line. By using the control argument
along with the rpart.control() function, we can define
various stopping criteria for tree growth. Setting
maxdepth = 1 allows us to create a decision tree that has
no more than one split.
# Fit a decision tree with one split
tree <- rpart(y ~ x, data = data2, control = rpart.control(maxdepth = 1))
# Predict the fitted values from the tree
data2$y_pred <- predict(tree)
# Plot the tree with rpart.plot
rpart.plot(tree)Let’s plot the result.
# Plot
ggplot(data2, aes(x = x, y = y)) +
geom_point(alpha = 0.5, color = "grey") + # observations
geom_line(aes(x = x_seq, y = y_true), color = "blue", linewidth = 1) + # true function
geom_step(aes(y = y_pred), color = "red", linewidth = 1) + # decision tree prediction
geom_vline(xintercept = tree$splits[1, "index"], linetype = "dashed") + # tree split
annotate("text", x = tree$splits[1, "index"], y = -1, label = "split", angle = 90, vjust = -1) +
labs(x = "x", y = "y") +
theme_minimal()We repeat the process, this time allowing the tree to grow with a maximum depth of three splits.
# Fit a decision tree with one split
tree <- rpart(y ~ x, data = data2, control = rpart.control(maxdepth = 3))
# Predict the fitted values from the tree
data2$y_pred <- predict(tree)
# Plot the tree with rpart.plot
rpart.plot(tree)# Plot
ggplot(data2, aes(x = x, y = y)) +
geom_point(alpha = 0.5, color = "grey") + # observations
geom_line(aes(x = x_seq, y = y_true), color = "blue", linewidth = 1) + # true function
geom_step(aes(y = y_pred), color = "red", linewidth = 1) + # decision tree prediction
labs(x = "x", y = "y") +
theme_minimal()The more complex the tree, the more steps to we create in the red function. At some point, the red line starts to overfit to the observations instead of capturing the underlying blue function.
Instead of restricting the depth of the tree, we can also set a minimum number of observations required in each leaf node. This prevents the tree from making further splits when the number of observations is too small, helping to reduce the risk of overfitting the data. We will learn more about this in the random forest section.
Summary
Before we go there, we have one last concept to cover.
Bagging, short for bootstrap aggregating, involves combining multiple classifiers, in this case decision trees, to enhance prediction accuracy and robustness. The process begins by creating several training datasets from the original data using bootstrapping, where random samples are taken with replacement. In the context of a random forest, this results in multiple decision trees, each trained on a different bootstrapped sample. The final prediction is obtained by aggregating the outputs—either by averaging in regression tasks or through majority voting in classification. By leveraging multiple trees, bagging helps reduce variance, mitigate overfitting, and improve overall model performance.
Using the same artificial example as before, we can look at the relationship between the number of decision trees in the ensemble and the prediction error. The graph shows that the prediction becomes better (closer to the blue line) with an increasing amount of trees in the ensemble.
However, one last problem remains! The trees in the bootstrap aggregating algorithm are highly correlated, i.e. they all look very similar. If the individual trees in the ensemble are highly correlated, the benefits of averaging are diminished because the trees make similar predictions on new data.
num_trees <- 10
# Fit multiple decision trees using lapply and bootstrap sampling
set.seed(123) # Set a seed for reproducibility
trees <- lapply(1:num_trees, function(i) {
# Create a bootstrap sample from the data (sampling with replacement)
bootstrap_sample <- data2[sample(1:nrow(data2), replace = TRUE), ]
# Fit a decision tree to the bootstrap sample
rpart(y ~ x, data = bootstrap_sample, model = TRUE, control = rpart.control(maxdepth = 6))
})
# Plot all trees in a grid
par(mfrow = c(2, 5)) # 2 rows, 5 columns
for (i in 1:num_trees) {
rpart.plot(trees[[i]], main = paste("Tree", i))
}This correlation among trees can limit the diversity of the ensemble, making it less effective at capturing different aspects of the data and potentially resulting in suboptimal performance. To address this issue, the random forest algorithm builds on bagging by introducing additional randomness through predictor selection.
The random forest algorithm extends the concept of bagging by introducing additional randomness. Each tree is trained on a different bootstrap sample of the data, and during tree construction, a random subset of predictors is considered at each split.This additional layer of randomness helps in creating a more diverse set of trees, thereby improving the robustness and generalization of the ensemble model.
Each time a split is performed, the search for the optimal split
variable is limited to a random subset of \(m\) of the all \(p\) predictors. The parameter is often
referred to as mtry = m.
Every time a bootstrap sample is constructed, there are observations
that get left out. There out-of-bag observations are not used
to construct the tree, but can be used assess the optimal parameters of
the model. For the OOB error to be valid, the number of observations
\(N\) should be sufficiently large.
Besides mtry, there are other parameters that can be tuned
to increase the performance of the algorithm. The most important ones
are:
mtry: Number of predictors to be considered at each
split.
ntree: The number of trees in the random forest.
min.node.size: The minimal number of observations a
leaf node.
In the following, we build a classification random forest using the
same example as before. Our goal is again to classify whether a student
is risk affine or risk averse. There are many R packages that include
the random forest algorithm. We will use the ranger package
as this is one of the most common ones. The first step is to tune the
parameters mtry, ntree and
min.node.size. To that end, we divide our student survey in
a training and a test set.
There are several packages available for parameter tuning in random
forests, and you can also write your own code for specific training
schemes. In this session, we will focus on the caret
package, as we did with LASSO. However, caret only allows
tuning of the mtry parameter, which significantly impacts
the model’s final accuracy. The ntree parameter is
different in that it can be as large as you like, and continues to
increases the accuracy up to some point. It is less difficult or
critical to tune and could be limited more by compute time available
more than anything. Therefore, we will first ensure that we have an
adequate number of trees and then proceed to tune the mtry
and min.node.size parameters.
library(ranger)
library(caret)
set.seed(123) # For reproducibility
# Create a training and test dataset
trainIndex <- createDataPartition (data$risk_affine, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
dim(trainData)## [1] 124 27
## [1] 51 27
Next, we will use the default settings for mtry and
min.node.size to examine how accuracy changes as we
increase the number of trees. To do this, we first define a grid with
the different numbers of trees we want to test. Then, we use a
for loop to fit a random forest model with the
ranger() function for each value of ntree, and
we record the resulting Root Mean Squared Error (RMSE) or prediction
error for each tree count.
# Creates a data frame (tuning_grid) where trees ranges from 5 to 400 in
# steps of 5. The rmse column is initialized with NA to be filled later.
tuning_grid <- expand.grid(
trees = seq(5, 400, by = 5),
rmse = NA
)
for(i in seq_len(nrow(tuning_grid))) { # Iterates over each row of the tuning_grid.
# Fits a random forest model with the current number of trees specified by
# tuning_grid$trees[i], and mtry set to the square root of 5.
# verbose = FALSE suppresses detailed output, and
# seed = 123 ensures reproducibility.
fit <- ranger(
formula = risk_affine ~ female + statistiknote_w1 + fachsemester +
erstfach_name + alkohol,
data = trainData,
num.trees = tuning_grid$trees[i],
mtry = sqrt(5),
verbose = FALSE,
seed = 123
)
# Stores the RMSE of the model in the rmse column of tuning_grid for the
# corresponding number of trees. The RMSE is calculated as the square root
# of the fit$prediction.error.
tuning_grid$rmse[i] <- sqrt(fit$prediction.error)
}Now we can plot the resulting RMSE values against the number of trees in the random forest.
We don’t see a very clear pattern, but after about 150 trees, the prediction error does not change much anymore. Notice, that the value range on the y-axis is very narrow.
Now, we select a sufficient number of tree and assume
ntree\(=150\) is fixed.
Using again the cross-validation approach included in the
caret package, we tune the remaining parameter
mtry and min.node.size.
# Defines the training scheme to be 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)We specify the values to be considered and also define the
splitrule, as this is required by the train function when
training a ranger object.
tune_grid <- expand.grid(
mtry = seq(2,4,0.05), # Number of variables randomly sampled at each split
splitrule = "gini", # Number of trees in the forest
min.node.size = c(1, 5, 10) # Minimum size of terminal nodes
)
tune_gridUsing the train() function and specifying the method to
be ranger the tuning can begin. The output is the model
with the preferred combination of parameters.
# Train the Random Forest model using ranger in caret
set.seed(123) # Set seed for reproducibility
rf_model <- train(
risk_affine ~ female + statistiknote_w1 + fachsemester +
erstfach_name + alkohol,
data = trainData,
method = "ranger",
trControl = train_control,
tuneGrid = tune_grid,
num.trees = 20
)
# Print the best model and results
print(rf_model)## Random Forest
##
## 124 samples
## 5 predictor
## 2 classes: '0', '1'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 111, 111, 112, 113, 111, 112, ...
## Resampling results across tuning parameters:
##
## mtry min.node.size Accuracy Kappa
## 2.00 1 0.5353147 0.038247395
## 2.00 5 0.4998252 -0.048408334
## 2.00 10 0.5173660 -0.008669779
## 2.05 1 0.5173660 0.007045028
## 2.05 5 0.5786713 0.116317107
## 2.05 10 0.5480186 0.072561258
## 2.10 1 0.5089161 -0.016741311
## 2.10 5 0.5417249 0.056412909
## 2.10 10 0.5912005 0.156419645
## 2.15 1 0.5005828 -0.027793930
## 2.15 5 0.5258159 0.026646838
## 2.15 10 0.4936480 -0.046968118
## 2.20 1 0.5403263 0.044044375
## 2.20 5 0.5486597 0.068795382
## 2.20 10 0.5012238 -0.030632103
## 2.25 1 0.5486597 0.061825088
## 2.25 5 0.5327506 0.041647222
## 2.25 10 0.5082751 -0.018019514
## 2.30 1 0.5417249 0.037419756
## 2.30 5 0.4511072 -0.124584572
## 2.30 10 0.5263403 0.020366886
## 2.35 1 0.5312354 0.020743172
## 2.35 5 0.5180070 0.001623293
## 2.35 10 0.5423660 0.039574137
## 2.40 1 0.5340326 0.023349083
## 2.40 5 0.5737762 0.111334072
## 2.40 10 0.5494172 0.067797882
## 2.45 1 0.5660839 0.105256057
## 2.45 5 0.5090326 -0.006380535
## 2.45 10 0.5321096 0.029400948
## 2.50 1 0.4930070 -0.050259918
## 2.50 5 0.5423660 0.039937446
## 2.50 10 0.5167249 0.003421626
## 2.55 1 0.5494172 0.070777828
## 2.55 5 0.5256993 0.015250634
## 2.55 10 0.5591492 0.082823398
## 2.60 1 0.5646853 0.096249274
## 2.60 5 0.5276224 0.027721105
## 2.60 10 0.5243007 0.030970772
## 2.65 1 0.5090326 -0.009392186
## 2.65 5 0.5173660 0.008792534
## 2.65 10 0.5063520 -0.021210028
## 2.70 1 0.4914918 -0.046251081
## 2.70 5 0.5724942 0.115815367
## 2.70 10 0.5481352 0.060368283
## 2.75 1 0.5249417 0.022856858
## 2.75 5 0.5660839 0.087902463
## 2.75 10 0.5403263 0.041468077
## 2.80 1 0.5326340 0.034627960
## 2.80 5 0.5487762 0.062416753
## 2.80 10 0.5013403 -0.041350335
## 2.85 1 0.5346737 0.038433117
## 2.85 5 0.5243007 0.018768697
## 2.85 10 0.5648019 0.098730519
## 2.90 1 0.5481352 0.058530758
## 2.90 5 0.5745338 0.111178548
## 2.90 10 0.5256993 0.019401510
## 2.95 1 0.5416084 0.046399733
## 2.95 5 0.5417249 0.041298272
## 2.95 10 0.5814685 0.133864359
## 3.00 1 0.4685315 -0.077121685
## 3.00 5 0.4858392 -0.047311678
## 3.00 10 0.4826340 -0.051146910
## 3.05 1 0.4949301 -0.038642435
## 3.05 5 0.4859557 -0.042792306
## 3.05 10 0.5040210 -0.010269699
## 3.10 1 0.5180070 0.015945717
## 3.10 5 0.5263403 0.028150489
## 3.10 10 0.4928904 -0.038302016
## 3.15 1 0.5152098 0.001358180
## 3.15 5 0.5712121 0.115765787
## 3.15 10 0.4865967 -0.041970646
## 3.20 1 0.4442890 -0.117301904
## 3.20 5 0.5396853 0.060514090
## 3.20 10 0.5205711 0.015028526
## 3.25 1 0.5173660 0.023093805
## 3.25 5 0.5243007 0.028150735
## 3.25 10 0.4942890 -0.034830609
## 3.30 1 0.5417249 0.066407914
## 3.30 5 0.4699301 -0.077319822
## 3.30 10 0.5096737 -0.004303981
## 3.35 1 0.5101981 -0.001538353
## 3.35 5 0.5090326 -0.004828861
## 3.35 10 0.5020979 -0.011785692
## 3.40 1 0.4691725 -0.078429527
## 3.40 5 0.5173660 0.007434506
## 3.40 10 0.5328671 0.040157955
## 3.45 1 0.4851981 -0.056799013
## 3.45 5 0.4782634 -0.064480800
## 3.45 10 0.5090326 -0.004668811
## 3.50 1 0.5094406 -0.009179617
## 3.50 5 0.5654429 0.103605868
## 3.50 10 0.5340326 0.034933613
## 3.55 1 0.5403263 0.060221785
## 3.55 5 0.4949301 -0.037878973
## 3.55 10 0.5032634 -0.026518827
## 3.60 1 0.4853147 -0.052206506
## 3.60 5 0.4773893 -0.070096587
## 3.60 10 0.5153263 0.019183804
## 3.65 1 0.5346737 0.044224868
## 3.65 5 0.5396853 0.052861993
## 3.65 10 0.4398019 -0.141594862
## 3.70 1 0.4865967 -0.057733357
## 3.70 5 0.4768648 -0.058828102
## 3.70 10 0.4928904 -0.042980324
## 3.75 1 0.4949301 -0.034929008
## 3.75 5 0.5117133 0.007408821
## 3.75 10 0.4928904 -0.029092948
## 3.80 1 0.4941725 -0.034709275
## 3.80 5 0.4748252 -0.077294320
## 3.80 10 0.5090326 -0.008105402
## 3.85 1 0.4853147 -0.049116005
## 3.85 5 0.5667249 0.100104600
## 3.85 10 0.5103147 -0.003326719
## 3.90 1 0.5032634 -0.015081677
## 3.90 5 0.5006993 -0.025625525
## 3.90 10 0.4434149 -0.119457309
## 3.95 1 0.5000583 -0.028432113
## 3.95 5 0.5174825 0.016282488
## 3.95 10 0.4878788 -0.059102270
## 4.00 1 0.5005828 -0.017368788
## 4.00 5 0.4999417 -0.025288981
## 4.00 10 0.5180070 0.012696126
##
## Tuning parameter 'splitrule' was held constant at a value of gini
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 2.1, splitrule = gini
## and min.node.size = 10.
Why evaluate at all? A model isn’t good because it fits the training data. It’s good if it generalizes and supports the decision you care about. Evaluation tells us if we’re actually meeting that goal.
To get an intuition about the model’s performance, we create predictions on our test data and evaluate the prediction accuracy using the confusion matrix. A confusion matrix is a performance measurement tool for classification models, including random forests. It provides a summary of the model’s predictions compared to the actual outcomes.
# Make predictions on the test set
predictions <- predict(rf_model, newdata = testData)
# Confusion matrix to evaluate performance
confusionMatrix(predictions, testData$risk_affine, positive = "1")## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8 6
## 1 15 22
##
## Accuracy : 0.5882
## 95% CI : (0.4417, 0.7242)
## No Information Rate : 0.549
## P-Value [Acc > NIR] : 0.33808
##
## Kappa : 0.1384
##
## Mcnemar's Test P-Value : 0.08086
##
## Sensitivity : 0.7857
## Specificity : 0.3478
## Pos Pred Value : 0.5946
## Neg Pred Value : 0.5714
## Prevalence : 0.5490
## Detection Rate : 0.4314
## Detection Prevalence : 0.7255
## Balanced Accuracy : 0.5668
##
## 'Positive' Class : 1
##
What do we learn from this?
The model has an accuracy of 58.82%, which is not significantly better than random guessing. The kappa value indicates poor agreement beyond chance. The model performs reasonably well in predicting class 0 (specificity) but less well for class 1 (sensitivity). The balanced accuracy suggests that overall performance is slightly above random guessing, but improvements could be made.
Overall, our model does not actually classify risk affinity very well. However, keep in mind that we have a very small dataset and very naively picked out potential determinants of risk affinity. Normally, we would have to reconsider our method or our variables at this point. Nonetheless, we will continue with this toy example to learn more about the random forest outcomes, such as variable importance and partial dependence.
# Fit the final model 'rf_model'
rf_final <- ranger(
formula = risk_affine ~ female + statistiknote_w1 + fachsemester +
erstfach_name + alkohol,
data = trainData,
num.trees = 20,
probability = TRUE, # Set probability as true
importance = "impurity", # Define the importance measure!
mtry = 2.21,
seed = 123
)Why not only look accuracy for model
evaluation?
Accuracy counts the fraction correct at one fixed threshold (often 0.5).
It can be misleading when:
What to use instead:
What does the classification random forest predict?
## 0 1
## [1,] 0.2435119 0.7564881
## [2,] 0.5251001 0.4748999
## [3,] 0.4573840 0.5426160
## [4,] 0.2386508 0.7613492
## [5,] 0.2183840 0.7816160
## [6,] 0.7805474 0.2194526
\(\Rightarrow\) It predicts the probability that the outcome is \(0\) or \(1\)!
Order the predictions by the probability of the class being \(1\):
# Pick the predicted probabilities of the class 1 and the real classes
prob1 <- cbind(rf_prob_pred$predictions[,"1"], as.integer(testData$risk_affine == "1"))
prob1 <- as.data.frame(prob1)
colnames(prob1) <- c("Prediction","True Class")
# Order them by prediction
prob1_ord <- prob1[order(prob1$Prediction, decreasing = TRUE), ]
head(prob1_ord)What should the threshold be, that the prediction is a \(1\) or a \(0\)?
Idea of the ROC-AUC: The ROC curve is a threshold-free view of how well a model separates classes, showing the trade-off between catching positives (TPR) and raising false alarms (FPR).
Let’s say we choose a threshold of \(0.84\). \(\Rightarrow\) Then probabilities of \(\geq 0.84\) predict the class \(1\), and \(0\) otherwise.
example <- head(prob1_ord)
example_84 <- cbind(example, c(1,1,0,0,0,0))
colnames(example_84) <- c("Prediction","True Class","Predicted Class (Treshold 0.84)")
example_84Now we compute true positives (TP) and false negatives (FN):
Rates at this threshold (provides one point on the ROC curve):
At threshold 0.84 the ROC point is (FPR = 0.33, TPR = 0.33) - we catch about one-thirds of positives while falsely flagging one-thirds of negatives. However, a single cutoff (e.g., 0.5 or 0.84) can be misleading under class imbalance, so the ROC considers every threshold.
\(\Rightarrow\) The ROC curve plots the TPR (y-axis) against the FPR (x-axis) for all threshold!
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
# 1) Probabilities on the test set
rf_prob_pred <- predict(rf_final, data = testData)
rf_prob <- rf_prob_pred$predictions[,"1"]
# Logistic regression baseline (same features)
logit_fit <- glm(risk_affine ~ female + statistiknote_w1 + fachsemester +
erstfach_name + alkohol,
data = trainData, family = binomial)
logit_prob <- predict(logit_fit, newdata = testData, type = "response")
test_y <- factor(testData$risk_affine, levels = c("0","1"))
# 2) ROC curves + AUC
roc_rf <- roc(response = test_y, predictor = rf_prob, levels = c("0","1"), direction = "<")
roc_log <- roc(response = test_y, predictor = logit_prob, levels = c("0","1"), direction = "<")
auc_rf <- auc(roc_rf)
auc_log <- auc(roc_log)
# 3) Plot ROC for both models
plot(roc_rf, col = "#1f77b4", lwd = 2, main = "ROC: Random Forest vs Logistic",
legacy.axes = TRUE, print.auc = FALSE)
plot(roc_log, col = "#ff7f0e", lwd = 2, add = TRUE)
legend("bottomright",
legend = c(paste0("Random Forest (AUC = ", round(auc_rf, 3), ")"),
paste0("Logistic Reg. (AUC = ", round(auc_log, 3), ")")),
col = c("#1f77b4", "#ff7f0e"), lwd = 2, bty = "n")AUC means the area under ROC:
Task: train a random forest using statistiknote_w1 as the outcome and female, risk_affine, fachsemester , erstfach_name, alkohol as the potential predictors. Assume to use 20 trees and find the best value of mtry! How well is our regression random forest performing?
Variable importance is a key feature in random forest models,
providing insights into how different predictors contribute to the
model’s predictions. In random forests, which consist of many decision
trees, variable importance helps identify which variables are most
influential in making accurate predictions. The intuition is that the
more often a specific variable is used for a tree split, the more
important is the variable for the overall prediction. We use the
vip package to examine variable importance in our
example.
##
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
##
## vi
We create a plain plot by only using our model as a function argument
of the vip() function.
You can access the calculated variable importance measures.
Using the explicit data above also allows to use any a ggplot2 barplot to visualize the variable importance.
# Example plot using ggplot2
v1_table <- v1$data
ggplot(v1_table, aes(x = reorder(Variable, Importance), y = Importance)) +
geom_bar(stat = "identity", position = "dodge", fill = "#4BADA9") +
coord_flip() +
scale_x_discrete(labels=c("statistiknote_w1"="Expected exam grade",
"erstfach_name"="Major",
"alkohol"="Alcohol consumption",
"fachsemester"="Semester",
"female"="Female"))+
labs(title = "", x = " ", y = "Importance") +
theme_minimal() +
theme(legend.position = "bottom")+
theme(axis.text=element_text(size=16),axis.title=element_text(size=16),
legend.text=element_text(size=12),
legend.title = element_text(size = 12))Partial Dependence (PD) is a technique used to interpret machine learning models, particularly ensemble methods like random forests. It helps in understanding the relationship between a specific predictor and the predicted outcome, while accounting for the average effect of other predictors in the model.
Partial Dependence Plots (PDPs) are graphical representations of this relationship. A PDP illustrates how the predicted outcome changes as the value of one or two predictors varies, while averaging out the effects of all other predictors. The PDPs provide a clear and intuitive way to interpret complex models. They help in diagnosing the nature of relationships (linear, nonlinear) between predictors and the target variable, and in identifying interaction effects between predictors.
In our case, we are working with a ranger model that classifies
whether a student is risk-affine or not, and we want to visualize the
effect of individual features like female,
statistiknote_w1, fachsemester,
erstfach_name, and alkohol on the model’s
predictions.
We use the pdp package to calculate partial
dependence.
##
## Attaching package: 'pdp'
## The following object is masked from 'package:purrr':
##
## partial
# Calculate the Partial Dependence for fachsemester
pdp_fachsemester <- partial(
object = rf_final,
pred.var = "fachsemester",
train = trainData,
prob = TRUE
)
# Plot the PDP for fachsemester
pdp_plot_fachsemester <- autoplot(pdp_fachsemester) +
ggtitle("Partial Dependence of the Expected Exam Grade on Fachsemester") +
ylab("Probability of being Risk Affine")+
xlab("Fachsemester")+
theme_minimal()
# Show the plot
pdp_plot_fachsemesterInterpretation:
statistiknote_w1 will have its possible grades
along the X-axis.statistiknote_w1, it indicates that higher grades are
associated with a higher probability of being risk-affine.Next, we look at the partial dependence of all predictor variables and combine everyhting into one plot.
# Calculate the Partial Dependence for statistiknote_w1
pdp_statistiknote <- partial(
object = rf_final,
pred.var = "statistiknote_w1",
train = trainData,
prob = TRUE # Since we're predicting probabilities
)
# Calculate the Partial Dependence for female
pdp_female <- partial(
object = rf_final,
pred.var = "female",
train = trainData,
prob = TRUE
)
# Calculate the Partial Dependence for fachsemester
pdp_fachsemester <- partial(
object = rf_final,
pred.var = "fachsemester",
train = trainData,
prob = TRUE
)
# Calculate the Partial Dependence for erstfach_name
pdp_erstfach <- partial(
object = rf_final,
pred.var = "erstfach_name",
train = trainData,
prob = TRUE
)
pdp_erstfach$erstfach_name <- as.factor(pdp_erstfach$erstfach_name)
# Calculate the Partial Dependence for alkohol
pdp_alkohol <- partial(
object = rf_final,
pred.var = "alkohol",
train = trainData,
prob = TRUE
)
pdp_alkohol$alkohol <- as.factor(pdp_alkohol$alkohol)
# Plot the PDP for fachsemester
p1 <- autoplot(pdp_fachsemester) +
ylab("Prob. of being Risk Affine")+
xlab("Number of Semesters")+
theme_minimal()
# Plot the PDP for erstfach_name
p2 <- autoplot(pdp_erstfach) +
ylab("Prob. of being Risk Affine")+
xlab("Major")+
theme_minimal()
# Plot the PDP for female
p3 <- autoplot(pdp_female) +
ylab("Prob. of being Risk Affine")+
xlab("Female")+
theme_minimal()
# Plot the PDP for alkohol
p4 <- autoplot(pdp_alkohol) +
ylab("Prob. of being Risk Affine")+
xlab("Alkohol Consumption")+
theme_minimal()
# Plot the PDP for statistiknote_w1
p5 <- autoplot(pdp_statistiknote) +
ylab("Prob. of being Risk Affine")+
xlab("Expected Exam Grade")+
theme_minimal()
# Combine the plots
(p1 | p2 ) /
(p3 | p4 ) /
( p5)